useR! 2024, Salzburg, Austria
July 9, 2024
Data Quality Problems
Class overlap occurs when instances of more than one class share a common region in the data space and are not clearly separable
This overlap can happen due to:
Class overlap occurs when instances of more than one class share a common region in the data space and are not clearly separable
This overlap can happen due to:
Inherent Similarity: Natural similarity between classes
Noise: Variability or errors in data collection
Class overlap occurs when instances of more than one class share a common region in the data space and are not clearly separable
This overlap can happen due to:
Inherent Similarity: Natural similarity between classes
Noise: Variability or errors in data collection
Feature Representation: Insufficient or inadequate features to separate classes
Class overlap occurs when instances of more than one class share a common region in the data space and are not clearly separable
This overlap can happen due to:
Inherent Similarity: Natural similarity between classes
Noise: Variability or errors in data collection
Feature Representation: Insufficient or inadequate features to separate classes
It makes it challenging for classifiers to accurately distinguish between classes
Classifiers struggle to correctly classify instances due to overlapping regions
Higher error rates occur in areas where classes overlap, leading to more instances being misclassified
Classifiers struggle to correctly classify instances due to overlapping regions
Higher error rates occur in areas where classes overlap, leading to more instances being misclassified
If the problem of class overlap is not addressed, models may become overly complex, leading to overfitting issues where the model performs well on training data but poorly on unseen data
clap R Packageinstall.packages('clap')
devtools::install_github("pridiltal/clap")
clap Framework
Normalize the columns of the data. (median and IQR)
This prevents variables with large variances having disproportional influence on Euclidean distances.
clap Framework
Leader Algorithm (Hartingan, 1975)
Calculate the nearest neighbor distances
clap Framework
Leader Algorithm (Hartingan, 1975)
Calculate the nearest neighbor distances
Performs clustering using a radius based on the maximum nearest neighbor distance
Case Study 1: Biopsy Data on Breast Cancer Patients
clump thickness
uniformity of cell size
uniformity of cell shape
marginal adhesion
single epithelial cell size
bare nuclei (16 values are missing)
bland chromatin
normal nucleoli
mitoses
Classification Task
Benign tumor - generally do not invade and spread
Malignant tumor cells are more likely to spread to other areas of the body.
Slides created with Quarto, available at prital.netlify.app.